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Abstract 



Most methods for decision-theoretic online learning are based on the Hedge algo- 
rithm, which takes a parameter called the learning rate. In most previous analyses 
the learning rate was carefully tuned to obtain optimal worst-case performance, 
leading to suboptimal performance on easy instances, for example when there ex- 
ists an action that is significantly better than all others. We propose a new way 
of setting the learning rate, which adapts to the difficulty of the learning prob- 
lem: in the worst case our procedure still guarantees optimal performance, but on 
easy instances it achieves much smaller regret. In particular, our adaptive method 
achieves constant regret in a probabilistic setting, when there exists an action that 
on average obtains strictly smaller loss than all other actions. We also provide a 
simulation study comparing our approach to existing methods. 

1 Introduction 

Decision-theoretic online learning (DTOL) is a framework to capture learning problems that proceed 
in rounds. It was introduced by Freund and Schapire 1 1 1 and is closely related to the paradigm of 
prediction with expert advice [2,3, 4|. In DTOL an agent is given access to a fixed set of K actions, 
and at the start of each round must make a decision by assigning a probability to every action. Then 
all actions incur a loss from the range [0, 1], and the agent's loss is the expected loss of the actions 
under the probability distribution it produced. Losses add up over rounds and the goal for the agent 
is to minimize its regret after T rounds, which is the difference in accumulated loss between the 
agent and the action that has accumulated the least amount of loss. 

The most commonly studied strategy for the agent is called the Hedge algorithm |Q]|5). Its per- 
formance crucially depends on a parameter 77 called the learning rate. Different ways of tuning 
the learning rate have been proposed, which all aim to minimize the regret for the worst possi- 
ble sequence of losses the actions might incur. If T is known to the agent, then the learning rate 
may be tuned to achieve worst-case regret bounded by y / T\n(K)/2, which is known to be opti- 
mal as T and K become large J4). Nevertheless, by slightly relaxing the problem, one can obtain 
better guarantees. Suppose for example that the cumulative loss L* T of the best action is known 
to the agent beforehand. Then, if the learning rate is set appropriately, the regret is bounded by 
\J2L* T ln(A") + ln(X) [4|, which has the same asymptotics as the previous bound in the worst case 
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(because L* T < T) but may be much better when L* T turns out to be small. Similarly, Hazan and 
Kale [6 1 obtain a bound of 8^VAR™ X In (if) + 10 hx(K ) for a modification of Hedge if the cumu- 
lative empirical variance VAR™ ax of the best expert is known. In applications it may be unrealistic to 
assume that T or (especially) L* T or VAR™ ax is known beforehand, but at the cost of slightly worse 
constants such problems may be circumvented using either the doubling trick (setting a budget on 
the unknown quantity and restarting the algorithm with a double budget when the budget is depleted) 
HO, or a variable learning rate that is adjusted each round JU [8). 

Bounding the regret in terms of L* T or VAR™ ax is based on the idea that worst-case performance is 
not the only property of interest: such bounds give essentially the same guarantee in the worst case, 
but a much better guarantee in a plausible favourable case (when L* T or VAR™ ax is small). In this 
paper, we pursue the same goal for a different favourable case. To illustrate our approach, consider 
the following simplistic example with two actions: let < a < b < 1 be such that b — a > 2e. Then 
in odd rounds the first action gets loss a + e and the second action gets loss b — e; in even rounds 
the actions get losses a — e and b + e, respectively. Informally, this seems like a very easy instance 
of DTOL, because the cumulative losses of the actions diverge and it is easy to see from the losses 
which action is the best one. In fact, the Follow -the-Leader strategy, which puts all probability mass 
on the action with smallest cumulative loss, gives a regret of at most 1 in this case — the worst-case 
bound 0(y/ ' L* T ln(if)) is very loose by comparison, and so is 0( 1 /VAR5l ax ln(if)), which is of the 
same order \JT ln(K). On the other hand, for Follow-the-Leader one cannot guarantee sublinear 
regret for worst-case instances. (For example, if one out of two actions yields losses |, 0, 1, 0, 1, . . . 
and the other action yields losses 0, 1, 0, 1,0,.. ., its regret will be at least T/2 — 1.) To get the best 
of both worlds, we introduce an adaptive version of Hedge, called AdaHedge, that automatically 
adapts to the difficulty of the problem by varying the learning rate appropriately. As a result we 
obtain constant regret for the simplistic example above and other 'easy' instances of DTOL, while 
at the same time guaranteeing 0{^/L* T ln(K)) regret in the worst case. 

It remains to characterise what we consider easy problems, which we will do in terms of the prob- 
abilities produced by Hedge. As explained below, these may be interpreted as a generalisation of 
Bayesian posterior probabilities. We measure the difficulty of the problem in terms of the speed at 
which the posterior probability of the best action converges to one. In the previous example, this 
happens at an exponential rate, whereas for worst-case instances the posterior probability of the best 
action does not converge to one at all. 

Outline In the next section we describe a new way of tuning the learning rate, and show that it 
yields essentially optimal performance guarantees in the worst case. To construct the AdaHedge 
algorithm, we then add the doubling trick to this idea in Section[3] and analyse its worst-case regret. 
In Section [4] we show that AdaHedge in fact incurs much smaller regret on easy problems. We 
compare AdaHedge to other instances of Hedge by means of a simulation study in Section [5] The 
proof of our main technical lemma is postponed to Section [6] and open questions are discussed in 
the concluding Section[7] Finally, longer proofs are only available as Additional Material in the full 
version at arXiv.org. 

2 Tuning the Learning Rate 

Setting Let the available actions be indexed by k £ {1, . . . , K}. At the start of each round 
t = 1,2,... the agent A is to assign a probability to each action k by producing a vector 
w t — (wj, . . . , wf) with nonnegative components that sum up to 1. Then every action k incurs 
a loss &l E [0, 1], which we collect in the loss vector t t — (l\ , . . . , £^), and the loss of the agent 

is w t • it = Ylk=i w t^t- After T rounds action k has accumulated loss L\ — Y^t=i ^t> an d tne 
agent's regret is 

T 

Ra{T) = ^2,w t • l t - L* T , 
t=i 

where L* T — miiiix/^/f L\ is the cumulative loss of the best action. 
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Hedge The Hedge algorithm chooses the weights w^ +1 proportional to e r,Lt , where rj > is 
the learning rate. As is well-known, these weights may essentially be interpreted as Bayesian pos- 
terior probabilities on actions, relative to a uniform prior and pseudo-likelihoods P t fc = e~ r,Lt = 

IlUi e""' J 0E30I: 
where 

is a generalisation of the Bayesian marginal likelihood. And like the ordinary marginal likelihood, 
B t factorizes into sequential per-round contributions: 

t 

B t = l[i». (2) 

s=l 

We will sometimes write w t (77) and B t (77) instead of w t and B t in order to emphasize the depen- 
dence of these quantities on 77. 



The Learning Rate and the Mixability Gap A key quantity in our and previous [4| analyses is 
the gap between the per-round loss of the Hedge algorithm and the per-round contribution to the 
negative logarithm of the "marginal likelihood" Bt, which we call the mixability gap: 

5 t { V )=w t {r l )-lt-(-± Hw t {rj) • e"** )) . 

In the setting of prediction with expert advice, the subtracted term coincides with the loss incurred 
by the Aggregating Pseudo-Algorithm (APA) which, by allowing the losses of the actions to be 
mixed with optimal efficiency, provides an idealised lower bound for the actual loss of any predic- 
tion strategy |9)- The mixability gap measures how closely we approach this ideal. As the same 
interpretation still holds in the more general DTOL setting of this paper, we can measure the diffi- 
culty of the problem, and tune 77, in terms of the cumulative mixability gap: 

T T 

A r (77) =Y, 6 M=Y. • f-t + i InBrfa). 

t=l i=l 

We proceed to list some basic properties of the mixability gap. First, it is nonnegative and bounded 
above by a constant that depends on 77: 

Lemma 1. For any t and 77 > we have < <5 t (77) < 77/8. 



Proof. The lower bound follows by applying Jensen's inequality to the concave function In, the 
upper bound from Hoeffding's bound on the cumulant generating function [4, Lemma A.l]. □ 

Further, the cumulative mixability gap Ay (77) can be related to L* T via the following upper bound, 
proved in the Additional Material: 

Lemma 2. For any T and 77 € (0, 1] we have At (77) < — . 

e — 1 

This relationship will make it possible to provide worst-case guarantees similar to what is possible 
when 77 is tuned in terms of L* T . However, for easy instances of DTOL this inequality is very loose, 
in which case we can prove substantially better regret bounds. We could now proceed by optimizing 
the learning rate 77 given the rather awkward assumption that At (r)) is bounded by a known constant 
b for all 77, which would be the natural counterpart to an analysis that optimizes 77 when a bound on 
L T is known. However, as At (rf) varies with 77 and is unknown a priori anyway, it makes more 
sense to turn the analysis on its head and start by fixing 77. We can then simply run the Hedge 
algorithm until the smallest T such that At (77) exceeds an appropriate budget 6(77), which we set to 



Kv) = [$ + £i)MK)- (3) 
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When at some point the budget is depleted, i.e. Ay(ry) > b(rf), Lemma [5] implies that 

r, > yJ(e-l)ln(K)/L* T , (4) 

so that, up to a constant factor, the learning rate used by AdaHedge is at least as large as the learning 
rates proportional to ^\xi(K) / L*p that are used in the literature. On the other hand, it is not too 
large, because we can still provide a bound of order 0{^/L* T ln(A")) on the worst-case regret: 

Theorem 3. Suppose the agent runs Hedge with learning rate jj £ (0, 1], and after T rounds has 
just used up the budget ([31, i.e. b(r)) < At(i]) < b(rf) + ry/8. Then its regret is bounded by 

R H e dge ( v )(T) < y J^ I L* T ln(K) + 1 ± l \n(K) + §. 
Proof. The cumulative loss of Hedge is bounded by 

T 

J2 ™t ■ it = &t(v) - jj In B T < b( V ) + v /8-^\nB T <^ I ln(K) + § + | ln(K) + L* T , (5) 
t=i 

where we have used the bound Bt > ^e~' ,Lr . Plugging in Q completes the proof. □ 



3 The AdaHedge Algorithm 

We now introduce the AdaHedge algorithm by adding the doubling trick to the analysis of the 
previous section. The doubling trick divides the rounds in segments i = 1,2, . . ., and on each 
segment restarts Hedge with a different learning rate rji. For AdaHedge we set r\\ = 1 initially, and 
scale down the learning rate by a factor of <f) > 1 for every new segment, such that rji = c/> 1_ \ We 
monitor A t (r)i), measured only on the losses in the i-th segment, and when it exceeds its budget 
bi = b{rji) a new segment is started. The factor is a parameter of the algorithm. Theorem[5]below 
suggests setting its value to the golden ratio (p = (1 + \/5)/2 ~ 1.62 or simply to <f> = 2. 



Algorithm 1 AdaHedge(^) > Requires 4> > 1 

V <- <t> 

for t = 1,2, . . . do 

if t = 1 or A > b then 

> Start a new segment 

77 ^ 77/0; &<-( s I r + A)ln(lir) 

A^O; w = (w\...,w K )^(j ? ,...,j ? ) 
end if 

> Make a decision 

Output probabilities w for round t 
Actions receive losses £ t 

> Prepare for the next round 



A<-A + w£t + ± ln(w • e"^') 

w <— (w 1 ■ e _,,£ t , . . . , w K ■ e~ vi ^ )/(w • e~ 
end for 

end 



The regret of AdaHedge is determined by the number of segments it creates: the fewer segments 
there are, the smaller the regret. 

Lemma 4. Suppose that after T rounds, the AdaHedge algorithm has started m new segments. 
Then its regret is bounded by 

/ 6 m — 1 \ / 
R A daHed ge {T) < 2ln(K)(^—-j +m(^ l ln(K) + | 

Proof. The regret per segment is bounded as in |5]). Summing over all m segments, and plugging in 

Y^iLi l/Vi — Sr^o 1 ft — (ft™ " 1) /(^ ~ 1) gi yes me required inequality. □ 
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Using Q, one can obtain an upper bound on the number of segments that leads to the following 
guarantee for AdaHedge: 

Theorem 5. Suppose the agent runs AdaHedge for T rounds. Then its regret is bounded by 

RAdaHed g e( T ) < ^ ^_ f 1 ^5=T L T m (*0 + 0{ln(L* T + 2) \n(K)) , 

For details see the proof in the Additional Material. The value for <j> that minimizes the leading 
factor is the golden ratio <f> = (1 + \/ r 5)/2, for which <py/ <fi 2 — 1/ (0 — 1) « 3.33, but simply taking 
4> = 2 leads to a very similar factor of <pyj <fi 2 — 1/ (0 — 1) sa 3.46. 

4 Easy Instances 

While the previous sections reassure us that AdaHedge performs well for the worst possible se- 
quence of losses, we are also interested in its behaviour when the losses are not maximally an- 
tagonistic. We will characterise such sequences in terms of convergence of the Hedge posterior 
probability of the best action: 

wl(r\) = max wf(ri). 

l<k<K 

(Recall that w k is proportional to e _1,it - 1 , so corresponds to the posterior probability of the 
action with smallest cumulative loss.) Technically, this is expressed by the following refinement of 
Lemma[T] which is proved in Section|6] 

Lemma 6. For any t and rj € (0, 1] we have 6 t (ri) < (e — 2)77(1 — w*(r))). 

This lemma, which may be of independent interest, is a variation on Hoeffding's bound on the 
cumulant generating function. While Lemma [I] leads to a bound on Ay (77) that grows linearly 
in T, Lemma [6] shows that At (77) may grow much slower. In fact, if the posterior probabilities w* 
converge to 1 sufficiently quickly, then At (77) is bounded, as shown by the following lemma. Recall 
that L* T = min 1 </ £ <x L\. 

Lemma 7. Let a and ft be positive constants, and let r G Z + . Suppose that for t = r, r + 1, . . . , T 
there exists a single action k* that achieves minimal cumulative loss L\ — L* t , and for k 7^ k* the 
cumulative losses diverge as — L* t > at 13 . Then for all 77 > 

T 

£(l-< +1 (ry))<CWr7- 1//3 , 

t=T 

where Ck = (K — l)a _1// ^r(l + 4) is a constant that does not depend on 77, r or T. 

The lemma is proved in the Additional Material. Together with Lemmas [T] and [6] it gives an upper 
bound on At (77), which may be used to bound the number of segments started by AdaHedge. This 
leads to the following result, whose proof is also delegated to the Additional Material. 

Let 5(777.) denote the round in which AdaHedge starts its m-th segment, and let L*(m) = 
L k , < , , — L k , , denote the cumulative loss of action k in that segment. 

s(m)+r— 1 s(m)— 1 fe 

Lemma 8. Let a > and j3 > 1/2 be constants, and let Ck be as in Lemma y\ Suppose there 
exists a segment m* £ Z+ started by AdaHedge, such that r := [8 ln(if )0( m *~ 1 ^ 2 ~ 1 / & - 8(e - 
2)C'ft- + lj > 1 and for some action k* the cumulative losses in segment m* diverge as 

L k (m* )-L k *(m* )>ar f) for all r > r and k ^ k* . (6) 

Then AdaHedge starts at most m* segments, and hence by Lemma |5] its regret is bounded by a 
constant: 

RAdaHedge(T) = 0(1). 

In the simplistic example from the introduction, we may take a = b — a — 2e and f3 — 1, such that 
|6| is satisfied for any r > 1. Taking m* large enough to ensure that t > 1, we find that AdaHedge 
never starts more than m* = 1 + [log^ ( ^{^(2) + 8ln(2) )1 se g men ts- Let us also give an example of 
a probabilistic setting in which Lemma[8]applies: 
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Theorem 9. Let a > and 5 € (0, 1] be constants, and let k* be a fixed action. Suppose the loss 
vectors £ t are independent random variables such that the expected differences in loss satisfy 



min E[e$ - l\ ] > 2a for all t e 

k^k* 



Then, with probability at least 1 — 5, AdaHedge starts at most 

■(K-l)(e-2) \n(2K/(a 2 S)) 



log (- 



aln(K) Aa 2 ln(K) 
segments and consequently its regret is bounded by a constant: 

RAdaHed g e(T) = 0{K + log(l/<5)). 



81n(iT) 



(7) 



(8) 



This shows that the probabilistic setting of the theorem is much easier than the worst case, for which 
only a bound on the regret of order 0(y / T \n(K)) is possible, and that AdaHedge automatically 
adapts to this easier setting. The proof of Theorem[9]is in the Additional Material. It verifies that the 
conditions of Lemma[H]hold with sufficient probability for j3 = 1, and a and m* as in the theorem. 



5 Experiments 

We compare AdaHedge to other hedging algorithms in two experiments involving simulated losses. 



5.1 Hedging Algorithms 

Follow -the-Leader. This algorithm is included because it is simple and very effective if the losses 
are not antagonistic, although as mentioned in the introduction its regret is linear in the worst case. 

Hedge with fixed learning rate. We also include Hedge with a fixed learning rate 

77 = ^2\n(K)/L* Tl (9) 

which achieves the regret bound ^J2 ln(K)L? r + ln(Kj}\ Since 77 is a function of L* T , the agent 
needs to use post-hoc knowledge to use this strategy. 

Hedge with doubling trick. The common way to apply the doubling trick to L* T is to set a budget on 
L* T and multiply it by some constant <f>' at the start of each new segment, after which 77 is optimized 
for the new budget |4j|7). Instead, we proceed the other way around and with each new segment 
first divide 77 by <p = 2 and then calculate the new budget such that Q holds when A t (77) reaches 
the budget. This way we keep the same invariant (77 is never larger than the right-hand side of |[9), 
with equality when the budget is depleted), and the frequency of doubling remains logarithmic in 
L* T with a constant determined by <j>, so both approaches are equally valid. However, controlling the 
sequence of values of 77 allows for easier comparison to AdaHedge. 

AdaHedge (Algorithm [TJi. Like in the previous algorithm, we set <fi = 2. Because of how we set up 
the doubling, both algorithms now use the same sequence of learning rates 1, 1/2, 1/4, ... ; the only 
difference is when they decide to start a new segment. 

Hedge with variable learning rate. Rather than using the doubling trick, this algorithm, described 
in [8 1, changes the learning rate each round as a function of L* t . This way there is no need to relearn 
the weights of the actions in each block, which leads to a better worst-case bound and potentially 
better performance in practice. Its behaviour on easy problems, as we are currently interested in, has 
not been studied. 

5.2 Generating the Losses 

In both experiments we choose losses in {0, 1}. The experiments are set up as follows. 

Cesa-Bianchi and Lugosi use 77 = ln(l + y/2 \nK/L* T ) 0, but the same bound can be obtained for the 
simplified expression we use. 
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Figure 1 : Simulation results 



I.I.D. losses. In the first experiment, all T = 10 000 losses for all K = 4 actions are independent, 
with distribution depending only on the action: the probabilities of incurring loss 1 are 0.35, 0.4, 
0.45 and 0.5, respectively. The results are then averaged over 50 repetitions of the experiment. 

Correlated losses. In the second experiment, the T = 10 000 loss vectors are still independent, 
but no longer identically distributed. In addition there are dependencies within the loss vectors £ t , 
between the losses for the K = 2 available actions: each round is hard with probability 0.3, and 
easy otherwise. If round t is hard, then action 1 yields loss 1 with probability 1 — 0.01/i and action 
2 yields loss 1 with probability 1 — 0.02/i. If the round is easy, then the probabilities are flipped and 
the actions yield loss with the same probabilities. The results are averaged over 200 repetitions. 

5.3 Discussion and Results 

Figure [T] shows the results of the experiments above. We plot the regret (averaged over repetitions 
of the experiment) as a function of the number of rounds, for each of the considered algorithms. 

I.I.D. Losses. In the first considered regime, the accumulated losses for each action diverge lin- 
early with high probability, so that the regret of Follow-the-Leader is bounded. Based on Theorem|9] 



we expect AdaHedge to incur bounded regret also; this is confirmed in Figure 1(a) Hedge with a 
fixed learning rate shows much larger regret. This happens because the learning rate, while it op- 
timizes the worst-case bound, is much too small for this easy regime. In fact, if we would include 
more rounds, the learning rate would be set to an even smaller value, clearly showing the need to 
determine the learning rate adaptively. The doubling trick provides one way to adapt the learning 
rate; indeed, we observe that the regret of Hedge with the doubling trick is initially smaller than the 
regret of Hedge with fixed learning rate. However, unlike AdaHedge, the algorithm never detects 
that its current value of rj is working well; instead it keeps exhausting its budget, which leads to a 
sequence of clearly visible bumps in its regret. Finally, it appears that the Hedge algorithm with 
variable learning rate also achieves bounded regret. This is surprising, as the existing theory for 
this algorithm only considers its worst-case behaviour, and the algorithm was not designed to do 
specifically well in easy regimes. 

Correlated Losses. In the second simulation we investigate the case where the mean cumulative 
loss of two actions is extremely close — within O(logi) of one another. If the losses of the actions 
where independent, such a small difference would be dwarfed by random fluctuations in the cumula- 
tive losses, which would be of order 0(\/i). Thus the two actions can only be distinguished because 
we have made their losses dependent. Depending on the application, this may actually be a more nat- 
ural scenario than complete independence as in the first simulation; for example, we can think of the 
losses as mistakes of two binary classifiers, say, two naive Bayes classifiers with different smooth- 
ing parameters. In such a scenario, losses will be dependent, and the difference in cumulative loss 
will be much smaller than 0(y/i). In the previous experiment, the posterior weights of the actions 
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converged relatively quickly for a large range of learning rates, so that the exact value of the learning 
rate was most important at the start (e.g., from 3000 rounds onward Hedge with fixed learning rate 
does not incur much additional regret any more). In this second setting, using a high learning rate 
remains important throughout. This explains why in this case Hedge with variable learning rate can 
no longer keep up with Follow-the-Leader. The results for AdaHedge are also interesting: although 
Theorem [9] does not apply in this case, we may still hope that A t (ry) grows slowly enough that the 
algorithm aoes not start too many segments. This turns out to be the case: over the 200 repetitions 
of the experiment, AdaHedge started only 2.265 segments on average, which explains its excellent 
performance in this simulation. 



6 Proof of Lemma |6] 

Our main technical tool is Lemma [6] Its proof requires the following intermediate result: 
Lemma 10. For any rj > and any time t, the function f(£t) = l n (wt ■ e - ^'^ is convex. 

This may be proved by observing that / is the convex conjugate of the Kullback-Leibler divergence. 
An alternative proof based on log-convexity is provided in the Additional Material. 



Proof of Lemma^ We need to bound St = w t (ri) ■ £ t + - \n(w t (r]) ■ e r,e ' t ), which is a convex 
function of t t by Lemma 10 As a consequence, its maximum is achieved when £ t lies on the 
boundary of its domain, such that the losses l\ are either or 1 for all k, and in the remainder of the 
proof we will assume (without loss of generality) that this is the case. Now let a t — w t • it be the 
posterior probability of the actions with loss 1. Then 

S t =a t + - In ((1 - a t ) + ate-"') =a t + - In (l + ^(e"'' - 1)) . 
rj V 

Using ln x < x — 1 and e _I? < 1 — 77 + \rj 2 , we get 5 t < \a t r\, which is tight for a t near 0. For a t 
near 1, rewrite 

S t = a t -1 + - ln(e r '(l - a t ) + a t ) 
V 

and use \nx < x — 1 and e 17 < 1 + r\ + (e — 2)7y 2 for 77 < 1 to obtain 5 t < (e — 2)(1 — at)r). 
Combining the bounds, we find 

S t < (e- 2)rjuim{a tl 1 - a t }. 
Now, let k* be an action such that u> t * = Wf . Then l\ — implies a t < 1 — w*. On the other 
hand, if = 1, then a t > wl so 1 — at < 1 — to*. Hence, in both cases min{a t , 1 — at} < 1 — w*, 
which completes the proof. □ 



7 Conclusion and Future Work 



We have presented a new algorithm, AdaHedge, that adapts to the difficulty of the DTOL learning 
problem. This difficulty was characterised in terms of convergence of the posterior probability of the 
best action. For hard instances of DTOL, for which the posterior does not converge, it was shown 
that the regret of AdaHedge is of the optimal order 0(y 'L* T ln(K)); for easy instances, for which 
the posterior converges sufficiently fast, the regret was bounded by a constant. This behaviour was 
confirmed in a simulation study, where the algorithm outperformed existing versions of Hedge. 

A surprising observation in the experiments was the good performance of Hedge with a variable 
learning rate on some easy instances. It would be interesting to obtain matching theoretical guar- 
antees, like those presented here for AdaHedge. A starting point might be to consider how fast the 
posterior probability of the best action converges to one, and plug that into Lemma[6] 
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A Additional Material 



Proof of Lemma^ Lemma A. 3 in [4| gives the bound lnE[e ,sX ] < (e s — 1)E[X] for any random 
variable X taking values in [0, 1] and any s £ K. Defining X — l\ with distribution w t , setting 
s = —i] and dividing by the negative factor (e _?? — 1), we obtain 

— l — ki(wfe-^)>w t -£ t . 

e 1 — 1 

It follows that 

T T 

Ay (77) = ( w t • *t + $ Hwt ■ e-^)) < -f(r])J2H^t ■ e- r ' £t ) = -/(r?) In Br, 



where /(ry) = 1/(1 — e '') — I/77 is a nonnegative, increasing function and Bt is the marginal 
likelihood (see <TTJ and (|2])). The lemma now follows by bounding Bt from below by j7e~ nLrr and 
using f{r l )<f{t) = ljte-l). ' " ' ' □ 



Proof of Theorem^] In order to apply Lemma|4]we will need to bound the number of segments m. 
To this end, let LF[i) denote the cumulative loss of the best action on the i-th segment. That is, if 
the i-th segment spans rounds ti, . . . , t%, then L*(i) = min^ J3 t L t l\- If m = 1, then the theorem 
is true by Lemma |4] so suppose that m > 2. Then we know that the budgets for the first m — 1 
segments have been depleted, so that for these segments Q applies, giving: 



A2m-2 _ 

Solving for m, we find 



1 tit _l lit _l lib— a. T*f'\ r* 

^ = E^- 2 = E^E • (,) - • <- — 



/ U 2 - l)L* \ 
m ^ l0 ^( (e-l)ln(l) +1 ) +1 - 
Substitution in Lemma|4]gives 

/ 6 m — 1 \ / \ 
i?AdaHedge(T) < 2 ln(iT) — J + m ln(if ) + | J 



2\n(K) / / (0 2 - 



1-1 + 0(ln(L T + 2)ln(iT)) 



1 V V (e - 1) ln(K) 

-0(ln(L T + 2) ln(K)), 



2\n(K) / (02 _ \)L* T 



0-1 y (e - l)ln(if) 
where the last step uses \J a + b < ^/a + y/b. Rearranging yields the theorem. □ 

Proof of Lemma^ For t between r and T we have 

K 

iM+ifa) = ^e""^-^) < 1 + (K- l)e' a ^\ 



k=l 

which implies 

T T 



£ (l <£(l- T ' 



+ (K - l)e~ ar i tf> . 

T oo 

= (AT - 1) £ l/(e ant " + K - 1) < (K - 1) £ e_QI,t/3 

t=T t=T 

/>oo 

<(#-!) / e"""*" dt = (K - l)(ar/)- 1 ^r(l + 1//3), 



where the integral can be evaluated using the variable substitution u = ar/t 13 and the fact that 

zT(z) =T(1 + z). " □ 
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Proof of Lemma^ Let 



T 



t— s(m* ) 



be a version of Ay (77) measured only on the rounds since the start of the m*-th segment. We bound 
the first r — 1 terms in this sum using Lemma [l] and the remaining terms using Lemmas [6] and™ 
which gives 



A r (m*,ry)= ]T (*,(»?)) 



t—s(m* ) 



< 



+ (e - 2)CW- 1/P < J(r - 1 + 8(e - 2)^-^ 



for any 77 < 1. 

We will argue that the budget in segment m* is never depleted: At (to* ,T)m*) < b(r) m *) = 
ln(K)/r] m * + ln(K)/(e — 1), for which it is sufficient to show that 

O TjjYx * 

t < S\n(K)ii r ^~ 2 - 8(e - 2)C K + 1 
= 81n(if)0(" l *- 1 )( 2 - 1 /' 3 ) - 8(e- 2)C K + 1, 
which is true by definition of t. □ 

Proof of Theorem^ We will show that the conditions of Lemma [8] (with the same a, and j3 = 1) 
are satisfied with probability at least 1 — 6, For r = r, t + 1, . . ., let A* denote the event that 
L k r {m*) - E[£*(m*)] > -fr for k ^ k*, let B r denote the event that Lf (to* ) - E[Lf (m*)] < 
^r, and let _D r = £>,. n Hfc^fe* denote the intersection of these events. Using Q, it can be seen 
that L\ — > ar on D r , as required by |6]). Hence we need to show that the probability of 
f\> T D r is at least 1 — 6, or equivalently that the probability of the complementary event [J r>T D r 
is at most 8. 

By Hoeffding's inequality [4] the probabilities of the complementary events A k and B r may each 
be bounded by exp(— ^a 2 r). And hence by the union bound the probability of D r is bounded by 
K exp(— \o?r). Again by the union bound it follows that 

°° -1 1*00 -1 -1 

Pr(|J A-) <J2 Ke ~^ 2r < K / e-^dr^e-2^- 1 ). 

r>r r=T Jr—l 

We require this probability to be bounded by 6, for which it is sufficient that 



2 (IK 

T > 

By definition of r, this is implied by 



••'"U)- 1 - (10) 



8M^--^-i>A ln Q +1 , 

which holds for our choice ([8]) of to*. To show that AdaHedge starts at most to* segments, it 
remains to verify the other condition of LemmalH] which is that r > 1. This follows from ( |T0] > upon 

observing that |7]l implies a < \ so that ^ In > 0. 

Finally, the bound on the regret is obtained by plugging to* into Lemma|4] □ 

2 Lemma[7]applies to a single run of the Hedge algorithm. As AdaHedge restarts Hedge at the start of every 
new segment, the times in Lemma[7] should be interpreted relative to the current segment, m* . 
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Proof of LemmaUdl Within this proof, let us drop the subscript t from £ t and w t , and define the 

— /jfc 

function (£) = e~ n for every action k. Let £$ and £$ be arbitrary loss vectors, and let A £ (0, 1) 
also be arbitrary. Then it is sufficient to show that 

In E [f k {£ x )} < (l-A)ln E [f k (£ )] + X\n E [/*(*)], (11) 

where £x = (1 — A)€q + A^i. Towards this end, we start by observing that fu is log-convex: 

ln/ fe (£ A ) < (l-A)ln/ fc (^ ) + Aln/ fe (£ 1 ). (12) 

Inequality [TT] now follows from the general fact that a convex combination of log-convex functions 
is itself log-convex, which we will proceed to prove: using first ( fT2| ) and then applying Holder's 
inequality (see e.g. ifTTI ) one obtains 

E l.h(£x)} < E [fM^h^f] < E {fM} 1 ^ E [/ fc (£ 1 )] A , 

k^w k^w k^w k^w 

from which ( |TT[ > follows by taking natural logarithms. □ 
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